class: middle, center # Biostatistics for Fluid Biomarkers Michael Donohue, PhD University of Southern California ### Biomarkers in Neurodegenerative Disorders University of Gothenburg May 26, 2021 .pull-left[ <img src="data:image/png;base64,#./images/atri.png" width="57%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#./images/actc_logo.png" width="47%" style="display: block; margin: auto;" /> ] --- # Course Overview .large[ Topics: - 9:00 - 9:50 -- Biostatistics for Fluid Biomarkers - 10:00 - 10:50 -- Biostatistics for Imaging Biomarkers - 11:00 - 11:50 -- Modeling Longitudinal Data Emphases: - Visualization - Demonstrations using R, code available from: - [https://github.com/atrihub/biomarkers-neuro-disorders-2021](https://github.com/atrihub/biomarkers-neuro-disorders-2021) ] --- # Session 1 Outline .large[ - Batch Effects - Experimental Design (Sample Randomization) - Statistical Models for Assay Calibration/Quantification - Classification (Supervised Learning) - Logistic Regression - Binary Trees - Random Forest - Mixture Modeling (Unsupervised Learning) - Univariate - Bivariate ] --- class: inverse, middle, center # Batch Effects --- # Batch Effects: Boxplot <img src="data:image/png;base64,#fluid_fig/batch_data_plot-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Coefficient of Variation .pull-left[ <table class="table table-striped table-condensed" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> batch </th> <th style="text-align:right;"> N </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> SD </th> <th style="text-align:right;"> SD/Mean = CV (%) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 790 </td> <td style="text-align:right;"> 379 </td> <td style="text-align:right;"> 48 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 925 </td> <td style="text-align:right;"> 299 </td> <td style="text-align:right;"> 32 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 725 </td> <td style="text-align:right;"> 389 </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 951 </td> <td style="text-align:right;"> 332 </td> <td style="text-align:right;"> 35 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 690 </td> <td style="text-align:right;"> 312 </td> <td style="text-align:right;"> 45 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 867 </td> <td style="text-align:right;"> 349 </td> <td style="text-align:right;"> 40 </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 837 </td> <td style="text-align:right;"> 446 </td> <td style="text-align:right;"> 53 </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 914 </td> <td style="text-align:right;"> 348 </td> <td style="text-align:right;"> 38 </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 883 </td> <td style="text-align:right;"> 271 </td> <td style="text-align:right;"> 31 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 763 </td> <td style="text-align:right;"> 266 </td> <td style="text-align:right;"> 35 </td> </tr> </tbody> </table> ] .pull-right[ - Coefficient of Variation (CV) = SD/Mean - Often used for quality control (reject batch with CV > `\(x\)`) ] --- # Testing for Batch Effects ```r anova(lm(Biomarker ~ batch, batch_data)) Analysis of Variance Table Response: Biomarker Df Sum Sq Mean Sq F value Pr(>F) batch 9 3573109 397012 3.37 0.00051 *** Residuals 490 57758046 117874 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` * Batch explains a significant amount of the variation in this simulated data * R note: `batch` variable must be a `factor`, not `numeric` (otherwise, you will get a batch slope) --- # Batch effects: Confounds <img src="data:image/png;base64,#fluid_fig/batch_confounds-1.svg" width="100%" style="display: block; margin: auto;" /> ??? Suppose we have groups of interest (say, active vs placebo) that we would like to compare. Do we see an problem here? --- class: inverse, middle, center # Experimental Design for Fluid Biomarkers --- # Randomized assignment of samples to plates <img src="data:image/png;base64,#fluid_fig/batch_randomized-1.svg" width="100%" style="display: block; margin: auto;" /> ??? If we have both groups represented in each batch, we can disentangle batch effects and group effects One way to ensure this, is to randomize samples to batches --- # Experimental Design for Fluid Biomarkers .large[ - Randomize samples to batches/plates - Longitudinally collected samples (samples collected over time on same individual): - If batch effects are expected to be larger than storage effects, consider randomizing *individuals* to batches - (Keep all samples from individual on the same plate) - Randomization can be stratified to ensure important factors (e.g. treatment group, age, APOE `\(\epsilon4\)`) are balanced ] --- # Sample Randomization We use an `R` package [SRS](https://github.com/atrihub/SRS) ("Subject Randomization System"), which we have modified to deal with the constraints of plate capacity, and keeping samples from the same subject together. (Note this is different than the `SRS` package on CRAN) <table class="table table-striped table-condensed" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Subject ID </th> <th style="text-align:left;"> Num. of samples </th> <th style="text-align:left;"> Group </th> <th style="text-align:left;"> Age </th> <th style="text-align:left;"> Plate </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> young </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> young </td> <td style="text-align:left;"> 10 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> young </td> <td style="text-align:left;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> young </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 12 </td> </tr> </tbody> </table> --- # Sample Randomization .pull-left[ <table class="table table-striped table-condensed" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;"> Plate </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> 12 </td> <td style="text-align:left;"> 13 </td> </tr> <tr> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 3 </td> </tr> <tr> <td style="text-align:left;"> young </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Num. samples </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 30 </td> </tr> </tbody> </table> ] .pull-right[ - Number of young and old well balanced across the 13 plates - Number of samples per plate is also reasonable (plate capacity was set at 30 samples) ] --- class: inverse, middle, center # Calibration --- # Calibration .large[ - Calibration: developing a map from "raw" assay responses to concentrations (ng/ml) using samples of *known* concentrations - We will explore some approaches to calibration with methods from the `R` package `calibFit` - The package includes some example data: - High Performance Liquid Chromatography (HPLC) and - Enzyme Linked Immunosorbent Assay (ELISA) - These examples are taken straight from the package vignette ] ??? The package is not actively maintained, so you must install the package from the CRAN archive --- # Calibration .pull-leftWider[ <img src="data:image/png;base64,#fluid_fig/calibFit_fits-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-rightNarrower[ - *Calibration* is *inverse regression* in which these fitted curves would be used to map assay responses from samples of unkown concentration (vertical axis) to concentration values (horizontal axis). - Both fits exhibit *heteroscedasticity*: the error variance is not constant with respect to Concentration - Most models assume *homoscedasticity*, or constant error variance. ] --- # Residuals (Response - Fitted values) <img src="data:image/png;base64,#fluid_fig/calibFit_residuals-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Typical Regression Typically, regression models are of the form: `\begin{equation} Y_{i}=f(x_i,\beta)+\epsilon_{i}, \end{equation}` where: - `\(Y_{i}\)` is the observed response/outcome for `\(i\)`th individual ( `\(i=1,\ldots,n\)` ) - `\(x_i\)` are covariates/predictors for `\(i\)`th individual - `\(\beta\)` are regression coefficients to be estimated - `\(f(\cdot,\cdot)\)` is the model (assumed "known" or to be estimated) - In linear regression `\(f(x_i,\beta)=x_i\beta\)` - `\(\epsilon_i\)` is the residual error - We assume `\(\epsilon\sim\mathcal{N}(0,\sigma^2)\)` - `\(\sigma\)` is the *constant* standard deviation (*homoscedastic*) If the standard deviation is not actually constant (*heteroscedastic*), estimates might be unreliable. --- ## Ordinary Least Squares: minimizing the sum of squared residuals <video width="100%" controls loop><source src="fluid_fig/regression-movie.webm" /></video> `\(^*\)` RSS = Residual sum of squares, or `\(\sum_i (\textrm{Observed}_i-\textrm{Fitted}_i)^2\)` --- # Modeling Heteroscedastic Errors The `calibFit` package includes models of the form: `\begin{equation} Y_{ij}=f(x_i,\beta)+\sigma g(\mu_i,z_i,\theta) \epsilon_{ij}, \end{equation}` where, - `\(Y_{ij}\)` are observed assay values/responses for `\(i\)`th individual ( `\(i=1,\ldots,n\)` ), `\(j\)`th replicate - `\(g(\mu_i,z_i,\theta)\)` is a function that allows the variances to depend on: - `\(\mu_i\)` (the mean response `\(f(x_i,\beta)\)`), - covariates `\(z_i\)`, and - a parameter ("known" or unknown) `\(\theta\)`. - `\(\epsilon_{ij}\sim\mathcal{N}(0,1)\)` In particular, `calibFit` implements the Power of the Mean (POM) function `\begin{equation} g(\mu_i,\theta) = \mu_i^{2\theta} \end{equation}` which results in `\begin{equation} \operatorname{var}(Y_{ij}) = \sigma^2\mu_i^{2\theta} \end{equation}` ??? allowing the variance to depend on the mean. --- # "Homogenized" Residuals From Fits with POM <img src="data:image/png;base64,#fluid_fig/calibFit_pom_residuals-1.svg" width="100%" style="display: block; margin: auto;" /> --- # HPLC Calibration With/Without POM Variance <img src="data:image/png;base64,#fluid_fig/calib_hplc_pom-1.svg" width="100%" style="display: block; margin: auto;" /> ??? The mean does not change much, but we get more accurate 95% confidence bands --- # Elisa Calibration With/Without POM Variance <img src="data:image/png;base64,#fluid_fig/calib_elisa_pom-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Calibrated Estimates for Each Sample .pull-left[ <img src="data:image/png;base64,#fluid_fig/calibrated1-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#fluid_fig/calibrated2-1.svg" width="100%" style="display: block; margin: auto;" /> ] ??? * MDC is Minimum Detectable Concentration, which we'll define on the next slide --- # Calibration Statistics Assuming calibration curve `\(f\)`, mapping concentrations to assay responses, is increasing, we define the following terms. **Minimum Detectable Concentration (MDC)**: The lowest concentration where the curve is increasing, or: `$$x_{\textrm{MDC}} = \min\{x : f(x, \beta) > \textrm{UCL}_0\}$$` where `\(\textrm{UCL}_0\)` is the upper confidence limit at 0 **Reliable Detection Limit (RDL)**: The lowest concentration that has a high probability of producing a response that is significantly greater than the response at 0, or `$$x_{\textrm{RDL}} = \min\{x : \textrm{LCL}_x > \textrm{UCL}_0 \}$$` **Limit of Quantitization (LOQ)**: The lowest concentration at which the coefficient of variation is less than a fixed percent (default is 20% in the `calibFit` package). --- class: inverse, middle, center # Supervised Learning ## Classification --- # Classification .pull-leftWider[ <img src="data:image/png;base64,#fluid_fig/classification-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-rightNarrower[ - Data from [adni.loni.usc.edu](adni.loni.usc.edu) - CSF Abeta 1-42 and t-tau assayed using the automated Roche Elecsys and cobas e 601 immunoassay analyzer system - Filter time points associated with first assay, and ignore subsequent time points - We'll ignore MCI and focus on CN vs Dementia - Values greater than the upper limit of detection have been assigned the limit ] --- # Classification <img src="data:image/png;base64,#fluid_fig/classification_no_mci-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Reciever Operatoring Characteristic (ROC) Curves .pull-left[ <img src="data:image/png;base64,#fluid_fig/ROC_abeta-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ For each potential threshold applied to CSF `\(\textrm{A}\beta 42\)`, we calculate: - Sensitivity: True Positive Rate = TP/(TP+FN) - Specificity: True Negative Rate = TN/(TN+FP) This traces out the ROC curve. A typical summary of a classifier's performance is the Area Under the Curve (AUC) AUC=0.83 in this case, with 95% CI ( 0.8, 0.86 ) AUCs close to one indicate good performance. The threshold shown here maximizes the distance between the curve and the diagonal line (chance) ] ??? Sensitivity is a measure of how well we are detecting positive cases Specificity is a measure of how well we are detecting controls or negative cases Youden's index gives equal weight to false positives and false negatives (not necessarily appropriate) --- # Comparing ROC Curves .pull-left[ <img src="data:image/png;base64,#fluid_fig/ROC_abeta_tau-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ | Marker | AUC | 95% CI | P-value `\(^*\)` | | ---------------------- |:--------------------:| ----------------------------:| ------------:| | `\(\textrm{A}\beta\)` | 0.83 | 0.8, 0.86 | | | Tau | 0.78 | 0.75, 0.82 | 0.07 | | Tau/ `\(\textrm{A}\beta\)` | 0.9 | 0.87, 0.92 | <0.001 | `\(^*\)` Bootstrap test comparing each row to `\(\textrm{A}\beta\)` So the ratio of Tau / `\(\textrm{A}\beta\)` shows the best discrimination of NC from Dementia cases. ] --- # Youden's Cutoff for Tau / `\(\textrm{A}\beta\)` Ratio <img src="data:image/png;base64,#fluid_fig/abeta_tau_scatter_youden-1.svg" width="100%" style="display: block; margin: auto;" /> Line is Tau = 0.394 `\(\times\)` Abeta, depicting Youden's cutoff (maximizes sensitivity + specificity - 1) ??? Youden's cutoff maximizing sensitivity + specificity - 1 is appropriate if sensitivity and specificity are equally important --- # Logistic Regression <table> <thead> <tr> <th style="text-align:left;"> Coefficient </th> <th style="text-align:right;"> Estimate </th> <th style="text-align:right;"> Std. Error </th> <th style="text-align:right;"> z value </th> <th style="text-align:left;"> Pr(>|z|) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.89 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> -6.7 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> scale(ABETA) </td> <td style="text-align:right;"> -1.59 </td> <td style="text-align:right;"> 0.15 </td> <td style="text-align:right;"> -10.6 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> scale(TAU) </td> <td style="text-align:right;"> 1.26 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 9.0 </td> <td style="text-align:left;"> <0.001 </td> </tr> </tbody> </table> `$$\log\big(\frac{p}{1-p}\big) = \hat\gamma_0 + A\beta_z \hat\gamma_{A\beta} + \textrm{tau}_z \hat\gamma_{\textrm{tau}}$$` where `\(\hat\gamma\)` are regression coefficients. --- # Logistic Regression Predicted Probabilities <img src="data:image/png;base64,#fluid_fig/logistic_pred_prob-1.svg" width="100%" style="display: block; margin: auto;" /> ??? line again depicts Youden's cutoff --- # Ratio Contours <img src="data:image/png;base64,#fluid_fig/ratio_gradient-1.svg" width="100%" style="display: block; margin: auto;" /> ??? by using ratio's we're simplifying the bivariate scatter by assuming all dots along these lines intersecting (0,0) are equivalent dashed line has slope 1 --- # Logistic Regression Predicted Probability Contours <img src="data:image/png;base64,#fluid_fig/ratio_gradient_logistic-1.svg" width="100%" style="display: block; margin: auto;" /> ??? in contrast, logistic regression assumes the predicted probability gradient follows these parallel lines lines now are where predicted probabilities from logistic regression are constant --- # Comparing ROC Curves .pull-left[ <img src="data:image/png;base64,#fluid_fig/ROC_logistic-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ | Marker | AUC | 95% CI | P-value `\(^*\)` | | ---------------------- |:------------------------:| --------------------------------:| ------------:| | `\(\textrm{A}\beta\)` | 0.83 | 0.8, 0.86 | | | Tau | 0.78 | 0.75, 0.82 | 0.07 | | Tau/ `\(\textrm{A}\beta\)` | 0.9 | 0.87, 0.92 | <0.001 | | Logisitic model | 0.9 | 0.87, 0.92 | <0.001 | `\(^*\)` Bootstrap test comparing each row to `\(\textrm{A}\beta\)` Logistic model ROC is very similar to Tau/ `\(\textrm{A}\beta\)` ratio ROC. ] --- # Logistic Regression with Age and APOE <table> <thead> <tr> <th style="text-align:left;"> Coefficient </th> <th style="text-align:right;"> Estimate </th> <th style="text-align:right;"> Std. Error </th> <th style="text-align:right;"> z value </th> <th style="text-align:left;"> Pr(>|z|) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -1.12 </td> <td style="text-align:right;"> 0.17 </td> <td style="text-align:right;"> -6.5 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> scale(ABETA) </td> <td style="text-align:right;"> -1.43 </td> <td style="text-align:right;"> 0.16 </td> <td style="text-align:right;"> -9.0 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> scale(TAU) </td> <td style="text-align:right;"> 1.19 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 8.5 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> scale(I(AGE + Years.bl)) </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.12 </td> <td style="text-align:right;"> 1.2 </td> <td style="text-align:left;"> 0.230 </td> </tr> <tr> <td style="text-align:left;"> as.factor(APOE4)1 </td> <td style="text-align:right;"> 0.37 </td> <td style="text-align:right;"> 0.25 </td> <td style="text-align:right;"> 1.5 </td> <td style="text-align:left;"> 0.144 </td> </tr> <tr> <td style="text-align:left;"> as.factor(APOE4)2 </td> <td style="text-align:right;"> 1.26 </td> <td style="text-align:right;"> 0.45 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:left;"> 0.005 </td> </tr> </tbody> </table> This model does not provide much better ROC, either. --- # Regression Trees <img src="data:image/png;base64,#fluid_fig/tree1-1.svg" width="100%" style="display: block; margin: auto;" /> ??? Regression trees use recursive partitioning to classify data into more and more homogeneous subgroups --- # Tree-based Methods <img src="data:image/png;base64,#fluid_fig/tree2-1.svg" width="100%" style="display: block; margin: auto;" /> ??? With this shallow tree, we end up with these four partitions of the Abeta-by-Tau scatter --- # Comparing ROC Curves .pull-left[ <img src="data:image/png;base64,#fluid_fig/ROC_rf-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ | Marker | AUC | 95% CI | P-value `\(^*\)` | | ---------------------- |:------------------------:| --------------------------------:| ------------:| | `\(\textrm{A}\beta\)` | 0.83 | 0.8, 0.86 | | | Tau | 0.78 | 0.75, 0.82 | 0.07 | | Tau/ `\(\textrm{A}\beta\)` | 0.9 | 0.87, 0.92 | <0.001 | | Logisitic model | 0.9 | 0.87, 0.92 | <0.001 | | Binary Tree | 0.88 | 0.86, 0.91 | <0.001 | | Random Forest | 0.95 | 0.93, 0.96 | <0.001 | `\(^*\)` Bootstrap test comparing each row to `\(\textrm{A}\beta\)` Random Forests re-fit binary trees on random subsamples of the data, then aggregate resulting trees into a "forest". This results in smoother predictions and a smoother ROC curve. ] --- class: inverse, middle, center # Unsupervised Learning ## Mixture Modeling --- # Unsupervised Learning .large[ - The classification techniques we just reviewed can be thought of as *Supervised Learning* in which we attempt to learn known "labels" (CN, Dementia). - *Mixture Modeling* is type of *Unsupervised Learning* technique in which we try to identify clusters of populations which appear to be arising from different distributions - Don't confuse *Mixture Models* with *Mixed-Effects Models* (which we'll discuss later) - Think: "Mixture of Distributions" ] --- # Distribution of ABETA <img src="data:image/png;base64,#fluid_fig/density_Abeta-1.svg" width="100%" style="display: block; margin: auto;" /> - Distribution is bimodal - Can we identify the two sub-distributions? - We'll explore with `mixtools` package --- # Distribution of ABETA <img src="data:image/png;base64,#fluid_fig/mixture_distribution_Abeta-1.svg" width="100%" style="display: block; margin: auto;" /> ??? mixture models provide latent class membership probabilities, such as these --- # Posterior Membership Probabilities <table> <thead> <tr> <th style="text-align:right;"> Abeta </th> <th style="text-align:right;"> Prob. Abnormal </th> <th style="text-align:right;"> Prob. Normal </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1033 </td> <td style="text-align:right;"> 0.58 </td> <td style="text-align:right;"> 0.42 </td> </tr> <tr> <td style="text-align:right;"> 1036 </td> <td style="text-align:right;"> 0.57 </td> <td style="text-align:right;"> 0.43 </td> </tr> <tr> <td style="text-align:right;"> 1044 </td> <td style="text-align:right;"> 0.53 </td> <td style="text-align:right;"> 0.47 </td> </tr> <tr> <td style="text-align:right;"> 1048 </td> <td style="text-align:right;"> 0.52 </td> <td style="text-align:right;"> 0.48 </td> </tr> <tr> <td style="text-align:right;"> 1061 </td> <td style="text-align:right;"> 0.46 </td> <td style="text-align:right;"> 0.54 </td> </tr> <tr> <td style="text-align:right;"> 1071 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> 0.58 </td> </tr> <tr> <td style="text-align:right;"> 1071 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> 0.58 </td> </tr> <tr> <td style="text-align:right;"> 1072 </td> <td style="text-align:right;"> 0.41 </td> <td style="text-align:right;"> 0.59 </td> </tr> </tbody> </table> --- ## Bivariate Density <iframe src="bvdensity_csf_tau.html" width="100%" height="500" id="igraph" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> --- # Bivariate Density Contour Plot <img src="data:image/png;base64,#fluid_fig/bv_kernel_density-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Bivariate Mixture Model Posterior Probabilities .pull-left[ <img src="data:image/png;base64,#fluid_fig/mvmix_post_prob-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#fluid_fig/mvmix_density-1.svg" width="100%" style="display: block; margin: auto;" /> ] ??? line on the left is still Youden's cutoff for the ratio Contours are confidence ellipses at 99%, 95%, and 90% just to help see the shape of the estimated distributions --- # Summary .large[ - Batch Effects - Experimental Design (Sample Randomization) - Statistical Models for Assay Calibration/Quantification - Classification (Supervised Learning) - Logistic Regression - Binary Trees - Random Forest - Mixture Modeling (Unsupervised Learning) - Univariate - Bivariate ] --- # References NULL